56 research outputs found

    Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

    Full text link
    Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.Comment: Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL) 2019. The data and code of our study are openly available at https://purl.org/hybridP

    Failing to hash into supersingular isogeny graphs

    Get PDF
    An important open problem in supersingular isogeny-based cryptography is to produce, without a trusted authority, concrete examples of "hard supersingular curves" that is, equations for supersingular curves for which computing the endomorphism ring is as difficult as it is for random supersingular curves. A related open problem is to produce a hash function to the vertices of the supersingular \ell-isogeny graph which does not reveal the endomorphism ring, or a path to a curve of known endomorphism ring. Such a hash function would open up interesting cryptographic applications. In this paper, we document a number of (thus far) failed attempts to solve this problem, in the hope that we may spur further research, and shed light on the challenges and obstacles to this endeavour. The mathematical approaches contained in this article include: (i) iterative root-finding for the supersingular polynomial; (ii) gcd's of specialized modular polynomials; (iii) using division polynomials to create small systems of equations; (iv) taking random walks in the isogeny graph of abelian surfaces; and (v) using quantum random walks.Comment: 33 pages, 7 figure

    Finishing the euchromatic sequence of the human genome

    Get PDF
    The sequence of the human genome encodes the genetic instructions for human physiology, as well as rich information about human evolution. In 2001, the International Human Genome Sequencing Consortium reported a draft sequence of the euchromatic portion of the human genome. Since then, the international collaboration has worked to convert this draft into a genome sequence with high accuracy and nearly complete coverage. Here, we report the result of this finishing process. The current genome sequence (Build 35) contains 2.85 billion nucleotides interrupted by only 341 gaps. It covers ∼99% of the euchromatic genome and is accurate to an error rate of ∼1 event per 100,000 bases. Many of the remaining euchromatic gaps are associated with segmental duplications and will require focused work with new methods. The near-complete sequence, the first for a vertebrate, greatly improves the precision of biological analyses of the human genome including studies of gene number, birth and death. Notably, the human enome seems to encode only 20,000-25,000 protein-coding genes. The genome sequence reported here should serve as a firm foundation for biomedical research in the decades ahead

    World Congress Integrative Medicine & Health 2017: Part one

    Get PDF

    The evolving SARS-CoV-2 epidemic in Africa: Insights from rapidly expanding genomic surveillance

    Get PDF
    INTRODUCTION Investment in Africa over the past year with regard to severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) sequencing has led to a massive increase in the number of sequences, which, to date, exceeds 100,000 sequences generated to track the pandemic on the continent. These sequences have profoundly affected how public health officials in Africa have navigated the COVID-19 pandemic. RATIONALE We demonstrate how the first 100,000 SARS-CoV-2 sequences from Africa have helped monitor the epidemic on the continent, how genomic surveillance expanded over the course of the pandemic, and how we adapted our sequencing methods to deal with an evolving virus. Finally, we also examine how viral lineages have spread across the continent in a phylogeographic framework to gain insights into the underlying temporal and spatial transmission dynamics for several variants of concern (VOCs). RESULTS Our results indicate that the number of countries in Africa that can sequence the virus within their own borders is growing and that this is coupled with a shorter turnaround time from the time of sampling to sequence submission. Ongoing evolution necessitated the continual updating of primer sets, and, as a result, eight primer sets were designed in tandem with viral evolution and used to ensure effective sequencing of the virus. The pandemic unfolded through multiple waves of infection that were each driven by distinct genetic lineages, with B.1-like ancestral strains associated with the first pandemic wave of infections in 2020. Successive waves on the continent were fueled by different VOCs, with Alpha and Beta cocirculating in distinct spatial patterns during the second wave and Delta and Omicron affecting the whole continent during the third and fourth waves, respectively. Phylogeographic reconstruction points toward distinct differences in viral importation and exportation patterns associated with the Alpha, Beta, Delta, and Omicron variants and subvariants, when considering both Africa versus the rest of the world and viral dissemination within the continent. Our epidemiological and phylogenetic inferences therefore underscore the heterogeneous nature of the pandemic on the continent and highlight key insights and challenges, for instance, recognizing the limitations of low testing proportions. We also highlight the early warning capacity that genomic surveillance in Africa has had for the rest of the world with the detection of new lineages and variants, the most recent being the characterization of various Omicron subvariants. CONCLUSION Sustained investment for diagnostics and genomic surveillance in Africa is needed as the virus continues to evolve. This is important not only to help combat SARS-CoV-2 on the continent but also because it can be used as a platform to help address the many emerging and reemerging infectious disease threats in Africa. In particular, capacity building for local sequencing within countries or within the continent should be prioritized because this is generally associated with shorter turnaround times, providing the most benefit to local public health authorities tasked with pandemic response and mitigation and allowing for the fastest reaction to localized outbreaks. These investments are crucial for pandemic preparedness and response and will serve the health of the continent well into the 21st century

    Data-driven estimation of the non-prompt background in same charged W±W±W^\pm W^\pm scattering within the ATLAS experiment

    No full text
    The scattering of vector bosons (VBS) offers a unique opportunity to study the electroweak sector of the Standard Model, the Higgs mechanism and furthermore physics beyond the Standard Model. A very promising channel to investigate vector boson scattering at the LHC is the scattering of same charged W±W^\pm bosons due to its comparatively high cross-section of the electroweak processes. The second largest background of this W±W±jjW^\pm W^\pm jj-EW signal originates from misidentified leptons (non-prompt leptons). This thesis aims to improve the data-driven method used so far to estimate this non-prompt background. In order to avoid large extrapolation factors from the control region used for the data-driven method to the signal region, a new control region using dilepton events is defined. This dilepton control region is kinematically closer to the signal region than the dijet control region used in the previous publications. The data-driven method is adapted to the dilepton control region and thoroughly tested with Monte Carlo simulated events. The data studied in this thesis were measured with the ATLAS experiment at a collision energy of 13 TeV with an integrated luminosity of 138.7 fb1\text{fb}^{-1}. Since the signal region of the analysis is blinded, the validity of the data-driven method is proven by using the low dijet invariant mass validation region, which is kinematically very close to the signal region. The data in this validation region is sufficiently well modeled by the sum of the data-driven estimated non-prompt background and the prompt and charge flip contribution estimated by Monte Carlo simulations. Therefore, the data-driven estimation of non-prompt background described in this thesis is expected to have only a small extrapolation to the signal region and thus provides a valuable contribution to the current W±W±jjW^\pm W^\pm jj-EW analysis

    Studien zur Entfaltung kinematischer Verteilungen in der elektroschwachen Streuung von W- und Z-Bosonen am LHC MK

    No full text
    For the comparison between kinematic distributions in the WZ -scattering measured by the ATLAS-detector and other measurements or theory predictions, it is necessary to consider the impact of detector effects such as the limited efficiency and resolution. One way to eliminate the detector impact on the data is unfolding. In this bachelor thesis Monte Carlo data of the transverse mass MTM_T (), the transverse momentum MTM_T () and the numb er of jets in the electroweak scattering of W - and Z -bosons are unfolded. The unfolding process is studied at different integrated luminosities which can b e partially reached in the Run 2 of the LHC. At first a method for the bin optimization is chosen in preparation of the unfolding. Different unfolding methods are used to calculate the unfolding result and its statistical and systematic uncertainties. Subsequently the uncertainty caused by the unfolding process is determined. Finally statistical fluctuated Monte Carlo pseudo data is unfolded to gain an impression of how the unfolding results of real measured data might look like. The procedure of unfolding, which is used in this thesis, is kept general and could b e applied to other data distributions
    corecore